Predicting Stroke Risk from Common Health Indicators

Supervisor: Dr. Cohen

1. Introduction

Stroke is one of the leading causes of death and disability worldwide and remains a major public health challenge[1]. Early identification of high-risk individuals is crucial for prevention and timely intervention. Therefore we develop and fit a Logistic Regression model using key health indicators to evaluate the effectiveness of a much simpler method.

Occam’s Razor: The simplest solution is always the best

2. Methodology: Logistic Regression

The Binary Logistic ModelThe Logistic Regression model uses the logit link function to model the probability of the outcome (\(\pi = P[Y =1]\)):

\[ln\left(\frac{\pi}{1-\pi}\right) = \beta_{0} + \beta_{1}x_{1} + \cdots + \beta_{k}x_{k} \quad \text{}\]

The main characteristic of the binary logistic regression is the type of dependent (or outcome) variable.[2] A dependent variable in a binary logistic regression has only two levels.

3. Analysis and Results

3.1. Dataset Preprocessing

The Stroke Prediction Dataset[3] started containing 5,110 observations and 12 features. After cleaning missing and inconsistent entries among other necessarychanges, ended as a dataset containing 3,357 observations and 11 predictors commonly associated with cerebrovascular risk. Those key predictors are listed below.

Feature Name Description Data Type Values
gender Patient’s gender Numeric 1 (Male), 0 (Female)
age Patient’s age in years Numeric Range 0.08 to 82; rounded to 2 decimal places
hypertension Indicates if the patient has hypertension Numeric 0 (No), 1 (Yes)
heart_disease Indicates if the patient has any heart diseases Numeric 0 (No), 1 (Yes)
ever_married Whether the patient has ever been married Numeric 1 (Yes), 0 (No)
work_type Type of occupation Numeric 1 (Govt_job), 2 (Private), 3 (Self-employed), 4 (Never_worked)
Residence_type Patient’s area of residence Numeric 1 (Urban), 2 (Rural)
avg_glucose_level Average glucose level in blood Numeric Range ≈55.12 to 271.74
bmi Body Mass Index Numeric Range ≈10.3 to 97.6; converted from character, rounded to 2 decimals
smoking_status Patient’s smoking status Numeric 1 (never smoked), 2 (formerly smoked), 3 (smokes)
stroke Target Variable: Whether the patient had stroke Numeric 0 (No Stroke), 1 (Stroke)

3. Analysis and Results

3.2. Dataset Visualization

We can observe from the histograms (a), (b), (c) and (d) the following:

  • The data appears to be slightly imbalanced towards female gender but the proportion of stroke cases appears similar for both genders.
  • The number of stroke cases increases after the age of \(\approx 50\) and peaks in the 60 to 80 age range.
  • The proportion of stroke cases (blue bar) is visibly much higher in the group with hypertension.
  • The proportion of stroke cases (blue bar) is visibly much higher in the group with heart disease.

3. Analysis and Results

3.2. Dataset Visualization

Histogram of (a)gender, (b)age, (c)hypertension, (d)heart_disease.

3. Analysis and Results

3.2. Dataset Visualization

We can observe from the histograms (e), (f), (g) and (h) the following:

  • Having been married being associated with a higher stroke risk in this dataset is possibly due to the married group skewing toward older ages.
  • Self-employed individuals appear to have the highest risk proportion among the working groups. Most likelly due lack of access to healtcare or benefits.
  • residence type does not appear to be a significant factor for stroke risk.
  • high average glucose (HbA1c) level is a significant risk factor for stroke.

3.2. Dataset Visualization

3.2. Dataset Visualization

Histogram of (e)ever_married, (f)work_type, (g)Residence_type, (h)avg_gloucose_level.

3. Analysis and Results

3.2. Dataset Visualization

We can observe from the histograms (i) and (j) the following:

  • The patient population (pink bars) falls within the overweight to obese range (BMI \(\approx 25\) to \(35\)). But the proportion of stoke outcome is a lot smaller within the healthy BMI range.
  • The highest proportional risk of stroke appears to be in the formerly smoked group. This finding is common in medical literature[4]

3. Analysis and Results

3.2. Dataset Visualization

Histogram of (i)bmi, (j)smoking_status.

3. Statistical Modelling

There is an massive increase in the \(\chi^2\) values which demonstrates that the oversampling technique has significantly increased the statistical power of the model.

Factor LR χ² (Original, anova2) LR χ² (Balanced, anova3) Change in χ²
age 120.407 1201.85 ≈10.0× Increase
hypertension 18.205 154.34 ≈8.5× Increase
avg_glucose_level 11.337 69.73 ≈6.1× Increase

3. Statistical Modelling

  • The imbalanced models achieved high Accuracy (\(\approx 94.5\%\)) and Specificity (\(\approx 0.998\)), but are practically useless for stroke prediction with a near-zero Sensitivity and missing almost all actual stroke cases.

  • Model 3 which utilized oversampling to address the severe class imbalance in stroke outcome, demonstrated a significant improvement in predictive capability: Sensitivity dramatically improved to \(0.6481\) being able to identifying 35 True Positives.

Metric Model 1 (Full, Imbalanced) Model 2 (Reduced, Imbalanced) Model 3 (Reduced, Balanced)
Accuracy 0.94538 0.9444 0.7269
Sensitivity (Recall) 0.01852 0.0000 0.6481
Specificity 0.99790 0.9979 0.7314
True Positives (TP) 1 0 35
False Negatives (FN) 2 2 256
True Negatives (TN) 951 951 697
False Positives (FP) 53 54 19

4. Conclusion

Logistic Regression althought being a very simple and interpretable baseline model for stroke risk prediction. During the project we were able to evaluate its weaknesses when analysing a heavily unbalanced dataset such as the Stroke Prediction Dataset. Furthermore, logistic regression models major weaknesses is not being able to determine causal relationship.[7]

We could evaluate that addressing class imbalance via oversampling (ROSE) was critical for achieving a model that can somewhat successfully predict the outcome of stroke. But this is nowhere near the same precision and accuracy of Ensemble Modeling such the Dense Stacking Ensemble (DSE) Model applied at[8] which used a meta-classifier to combine the strengths of simpler models with higher-performing complex models.

References

1. World Health Organization. (2025). The top 10 causes of death. https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death.
2. Harris, J. K. (2019). Statistics with r: Solving problems using real-world data. SAGE Publications.
3. Palacios, F. S. (n.d.). Stroke Prediction Dataset. https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset
4. Oshunbade, A. A., Yimer, W. K., Valle, K. A., Clark III, D., Kamimura, D., White, W. B., DeFilippis, A. P., Blaha, M. J., Benjamin, E. J., O’Brien, E. C., et al. (2020). Cigarette smoking and incident stroke in blacks of the jackson heart study. Journal of the American Heart Association, 9(12), e014990.
5. Gomila, R. (2021). Logistic or linear? Estimating causal effects of experimental treatments on binary outcomes using regression analysis. Journal of Experimental Psychology: General, 150(4), 700.
6. Gelman, A., & Hill, J. (2007). Causal inference using regression on the treatment variable. In Data Analysis Using Regression and Multilevel/Hierarchical Models (Vol. 2006, pp. 167–194). Cambridge University Press New York, NY.
7. Alison, P. (2014). Prediction vs. Causation in regression analysis, statistical horizons.
8. Hassan, A., Gulzar Ahmad, S., Ullah Munir, E., Ali Khan, I., & Ramzan, N. (2024). Predictive modelling and identification of key risk factors for stroke using machine learning. Scientific Reports, 14(1), 11498.